Model Selection

Real-time speech processing

# Real-time speech processing

Ultravox V0 5 Llama 3 2 1b

A multilingual text-to-text model preloaded with meta-llama/Llama-3.2-1B-Instruct weights

Large Language Model

Transformers Supports Multiple Languages

Ultravox V0 5 Llama 3 2 1b ONNX

Ultravox is a multilingual audio-to-text model optimized based on the LLaMA-3-2.1B architecture, supporting speech recognition and transcription tasks in multiple languages.

Transformers Supports Multiple Languages

Segmentation 3.0

This is an audio segmentation model capable of detecting speaker changes, voice activity, and overlapping speech, suitable for audio analysis in multi-speaker scenarios.

Audio Processing

A fine-tuned Uzbek speech recognition model based on Oyqiz/uzbek_stt, specifically optimized for legal and military domain data

Speech Recognition

Transformers Other

Segmentation 3.0

This is a speaker segmentation model based on pyannote.audio, capable of detecting speech activity, speaker changes, and overlapping speech.

Audio Processing

Speaker Diarization 3.0

Speaker diarization pipeline trained on pyannote.audio 3.0.0, supporting automatic voice activity detection, speaker change detection and overlapping speech detection

Speaker Analysis

Wav2vec Fine Tuned Speech Command2

A speech recognition model fine-tuned on the speech_commands dataset based on facebook/wav2vec2-base, achieving 97.35% accuracy

Audio Classification

Speechcommand Demo

A fine-tuned voice command classification model based on facebook/wav2vec2-base, trained on the superb dataset with an accuracy of 98.09%

Audio Classification

S2t Small Mustc En Nl St

An end-to-end speech translation model based on S2T architecture, specifically designed for English-to-Dutch speech translation tasks

Speech Recognition

Transformers Supports Multiple Languages

Wav2vec2 Large Xlsr 53 Greek

This is a Greek automatic speech recognition model based on the XLSR-Wav2Vec2 architecture, developed by the Hellenic Military Academy and the Technical University of Crete.

Speech Recognition Other

Sepformer Wham Enhancement

A toolkit for speech enhancement (denoising) using the SepFormer model, pre-trained on the WHAM! dataset (8kHz sampling rate version) to remove environmental noise and reverberation.

Audio Enhancement English

Sepformer Whamr Enhancement

This model achieves speech enhancement (denoising + dereverberation) through the SepFormer architecture, pre-trained on the WHAMR! dataset (8kHz), with a test set SI-SNR of 10.59dB.

Audio Enhancement English

S2t Small Mustc En Es St

A speech-to-text transformer model for end-to-end English to Spanish speech translation

Speech Recognition

Transformers Supports Multiple Languages

Convtasnet Libri3Mix Sepnoisy 8k

A ConvTasNet model trained based on the Asteroid framework, designed to separate 3 independent audio sources from mixed audio, specifically optimized for noisy speech data at 8kHz sampling rate.

Sound Separation

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase